Reproducible Research

is

Research Software Engineering

David Mawdsley, Robert Haines and Caroline Jay

RSE 2017 Conference

Caroline’s part

  • Set background etc.

Self-contained reproducible research papers

  • We can write our paper in LaTeX or Markdown, including R code as required
  • Everything is in R; only external dependency is the data
    • In principle (pretty much) a solved problem
  • Some friction points:
    • software versions → Docker or Packrat
    • formatting, e.g. tables
    • collaboration; e.g. working with Overleaf

What if you can’t do everything in R?

  • Complex dependencies
  • Time consuming-analyses
  • Long pipelines

Our approach

  • Make modular by containerising each step using Docker
  • Reusable, reproducible
  • The final module makes the paper
  • Join outputs of containers with Makefile

Example - IDInteraction

  • Automate the coding of behaviours
  • This is really slow and tedious to do by hand.

Docker images

  • Each module contains its own Makefile
  • Example: object tracking

Docker images

  • Each module contains its own Makefile
  • Example: object tracking

Docker images

  • Each module contains its own Makefile
  • Example: object tracking

Challenges

  • Additional complexity
    • Pipeline can be difficult to debug
  • Requires Docker
  • Top-level makefile can become unwieldy.
    • Investigating workflow languages as an alternative.

Benefits

  • Transparency
  • Allows others to re-run and extend our analyses
    • Re-run, to verify
    • Re-run, to modify (e.g. bounding box)
    • Re-run, to extend (e.g. new data, new tracking methods)
  • Moves away from the static “one-shot” publication
  • RSE is an integral part of the paper production process → appropriate credit

Caroline’s wrap up

  • RSEs get credit for paper, since producing a significant part of the paper
  • This model lets us update results more easily; but requires rethinking of publication model
  • This is worth doing since it allows more effective and efficient science, which will advance knowledge more quickly

Caroline’s wrap up (2)

  • RSEs involved
  • Aim is to do science well.
  • Issue of software credit (largely) goes
  • Efficiently advancing science
  • But then the issue is the publication model
  • Maximising the gain of this approach requires us to reconsider this
  • But should reconsider it since it allows us to do our science more efficiently and better
  • Something that’s going to be enabled by moving away from traditional physical paper approach.
  • Doesn’t replace paper narrative.
  • different end product from a Jupyter notebook

(Notes: Finish technical bit with a description of what this allows other people to do - allows other people to re-run some of our analysis. This is how we’ve made the paper - what does this mean for another researcher who comes to this work - they can rerun these aspects completely. But bbox placed by a human - someone could explore how variation in placement of bbox affected results, could rerun and compare. A new experiment, using our paper, updating our paper. Another potential novel contribution, from the same tool chain.
Don’t need to talk about the value of containerisation - talk more about the value of making the paper.
Under the current publishing model - the publication is static. One of the potential benefits is moving away from one-shot papers.

Gives us extra stuff, but how many reviewers will build pipeline. Probably not many. The value is improving the research (incremental) process; causes us to reflect on the publication model. What we can’t do with (most) standard journals is publish an almost identical looking paper, but with a different data-set.

Doing this because it’s the right thing to do, but what’s the value in it? Not possible to build these papers without RSEs Old model - one thinks of something, writes software. Stops. Write paper This approach blends this through; no gap between software and paper - same process. The paper is almost software.

)